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Abstract 

Provenance is information about the origin, derivation, ownership, 
or history of an object. It has recently been studied extensively 
in scientific databases and other settings due to its importance in 
helping scientists judge data validity, quality and integrity. How- 
ever, most models of provenance have been stated as ad hoc defini- 
tions motivated by informal concepts such as "comes from", "influ- 
ences", "produces", or "depends on". These models lack clear for- 
malizations describing in what sense the definitions capture these 
intuitive concepts. This makes it difficult to compare approaches, 
evaluate their effectiveness, or argue about their validity. 

We introduce provenance traces, a general form of provenance 
for the nested relational calculus (NRC), a core database query 
language. Provenance traces can be thought of as concrete data 
structures representing the operational semantics derivation of a 
computation; they are related to the traces that have been used 
in self-adjusting computation, but differ in important respects. We 
define a tracing operational semantics for NRC queries that pro- 
duces both an ordinary result and a trace of the execution. We show 
that three pre-existing forms of provenance for the NRC can be 
extracted from provenance traces. Moreover, traces satisfy two se- 
mantic guarantees: consistency, meaning that the traces describe 
what actually happened during execution, wA fidelity, meaning that 
the traces "explain" how the expression would behave if the input 
were changed. These guarantees are much stronger than those con- 
templated for previous approaches to provenance; thus, provenance 
traces provide a general semantic foundation for comparing and 
unifying models of provenance in databases. 

1. Introduction 

Sophisticated computer systems and programming techniques, par- 
ticularly database management systems and distributed computa- 
tion, are now being used for large-scale scientific endeavors in 
many fields including biology, physics and astronomy. Moreover, 
they are used directly by scientists who — often justifiably — view 
the behavior of such systems is opaque and unreliable. Simply pre- 
senting the result of a computation is not considered sufficient to 
establish its repeatability or scientific value in (for example) a jour- 
nal article. Instead, it is considered essential to provide high-level 
explanations of how a part of the result of a database query or dis- 
tributed computation was derived from its inputs, or how a database 
came to be the way it is. Such information about the source, con- 
text, derivation, or history of a (data) object is often called prove- 
nance. 

Currently, many systems either require their users to deal with 
provenance manually or provide one of a variety of ad hoc, cus- 
tom solutions. Manual recordkeeping is tedious and error-prone, 
while both manual and custom solutions are expensive and provide 
few formal correctness guarantees. This state of affairs strongly 



motivates research into automatic and standardized techniques for 
recording, managing, and exploiting provenance in databases and 
other systems. 

A number of approaches to automatic provenance tracking 
have been studied, each aiming to capture some intuitive as- 
pect of prov enance such as "Whe re did a result come from in 
the inp ut?" jBunem an et al.l '2001"), "What inputs influenced a 
result?"l iCui et alj SoOO; Buneman et al. 2001), "How was a result 
produced from the input?" (Green etal. 2007), or "What inputs 
do results depend on?" (Chenevetal. 2007). However, there is 
not yet much understanding of the advantages, disadvantages and 
formal guarantees offered by each, or of the relationships among 
them. Many of these techniques have been presented as ad hoc 
definitions without clear formal specifications of the problem the 
definitions are meant to solve. In some cases, loose specifications 
have been developed, but they appear difficult to extend beyond 
simple settings such as monotone relational queries. 

Therefore, we believe that semantic foundations for provenance 
need to be developed in order to understand and relate existing tech- 
niques, as well as to motivate and validate new techniques. We fo- 
cus on provenance in database management systems, because of 
its practical importance and because several interesting provenance 
techniques have already been developed in this setting. We inves- 
tigate a semantic foundation for provenance in databases based on 
traces. We begin with an operational semantics based on stores in 
which each part of each value has a label. We instrument the seman- 
tics so that as an expression evaluates, we record certain properties 
of the operational derivation in a provenance trace. Provenance 
traces record the relationships between the labels in the store, ul- 
timately linking the result of a computation to the input. Traces can 
be viewed as a concrete representation of the operational semantics 
derivation showing how each part of the output was computed from 
the input and intermediate values. 

We employ the nested relational calculus (NRC), a core database 
query language closely related to monadic comprehensions as used 
in Ha skell and other functional programming languages iWadleJ 

1 19921) . The nested relational model also forms th e basis for dis- 

tribut ed programming systems such as MapReduce jDean and GhemawatI 
l2008h and PigLatin (Olstonetal. 2008) and is closely related to 
XML. Thus, our results should generalize to these other settings. 

This paper makes the following contributions: 

• We define traces, traced evaluation for NRC queries, and a trace 
adaptation semantics. 

• We show that we can extract several other forms of provenance 
that have been developed for the N^C from tra ces, including 
where-prove nance ( Buneman et ^ 1200 IL 120071) . depend ency 
provenance fChenev et al. 2007), and semiring-provenance jGreen et alj 
2007; Foster etal. 2008). The semiring-provenance model al- 
ready generalize s several other forms of provenanc e such as 
why-provenance jBuneman et 'ZI I2OOII) and lineage jCui et alj 
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12000^. but where -provenance and dependency-provenance are 
not instances of the semiring model. Provenance traces thus 
unify three previously unrelated provenance models. 
• We state and prove properties which establish traces as a solid 
semantic foundation for provenance. Specifically, we show that 
the trace generated by evaluating an expression is consistent 
with the resulting store, and that such traces are "explanations" 
that help us understand how the expression would behave if the 
input store is changed. This is the main contribution of the pa- 
per, and in particular the explanation property is a key "correct- 
ness" property for provenance that has been absent from previ- 
ous work on this topic. 

We want to emphasize that provenance traces are not a proposal 
for a concrete, practical form of provenance. Traces are a candidate 
answer to the question "what is the most detailed form of prove- 
nance we could imagine recording?" We expect that it is unlikely 
that provenance traces would be implementable within a large-scale 
database system. Other practical provenance techniques will nec- 
essarily sacrifice or approximate some of the detail of provenance 
traces in return for efficiency. Thus, the role of provenance traces is 
to provide a way to explain precisely what is lost in the process. 

The traces used in this paper are also related to traces studied in 
other settings, particularly in AF L, an adaptive functional language 
introduced bv lAcar et al T( l2006h . However, there are important dif- 
ferences. First, while AFL leaves it up to the programmer to iden- 
tify modifiable inputs and changeable outputs, provenance traces 
implicitly treat every part of the input as modifiable and every part 
of the output as changeable. This may make provenance traces too 
inefficient for practical use, but our main goal here is to identify a 
rich, principled form of provenance and efficiency is a secondary 
concern. Second, AFL traces are based directly on source lan- 
guage expressions, and were not designed with human-readability 
or provenance extraction in mind. In contrast, provenance traces 
can be viewed as directed acyclic graphs (with some extra struc- 
ture and annotations) that can easily be traversed to extract other 
forms of provenance. Finally, AFL includes user-defined, recursive 
functions, whereas the NRC does not include function definitions 
but does provide collection types and comprehension operations. 
These differences are minor; it appears straightforward to add the 
missing features to the respective languages. 

An example As a simple example, consider an expression if x = 
5 then y + 42 else x. If we run this on an input store x — 5''^ ,y = 
42'" then the result is 47' , and the trace is 

1_1' <- l_x = 5; 
cond(l_l', t, 1' <- l_y+42) 

This trace records that we first test whether Ix = 5, then do a 
conditional branch. The cond trace records the tested label I'l, its 
value, and a subtrace showing how we computed the final result I' 
by copying from ly. 

As a more complicated example illustrating traces for relational 
operations, consider a SQL-style query that selects only the B- 
values of records in table R: 

SELECT B FROM R 

which corresponds to the NRC expression {nB{x) \ x £ R}. 
When run on R = {{A : 1, B : 2), (A : 2,B : 3)} the resuU is 
{2, 3}. If we regard the input as labeled as follows: {{A : l'^^ , B : 
2*12 )'i , (A : 2'2i , B : 3'^^)'^ }' then the resulting trace is 

1' <- comp(l,{[l_l] 1_1' <- proj_B (1_1,1_12), 
[1_2] 1_2' <- proj_B (1_2,1_22)}) 

producing labeled output {2'i,3'2}' . This trace shows that the 
result is obtained by comprehension over I. There are two elements. 



h and I2, yielding results I'l = I12 and I2 = I22, which were 
obtained by projecting the B field from li and I2 respectively. 

It should be clear that traces can in general be large and difficult 
to interpret because they are very low-level. As mentioned above, 
we can slice traces by discarding irrelevant information to obtain 
smaller traces that are more useful as explanations of how a specific 
part of the output was produced or how a part of the input was used. 
As a simple example, if we are only interested in I'l in the output of 
the second example, we can slice the trace "backwards" from I'l to 
obtain 

1' <- comp(l,{[l_l] 1_1' <- proj_B (1_1,1_12)}, 
X. \pi_B(x)) 

Dually, if we wish to see how some part of the input influences parts 
of the output, we can slice "forwards". For example, the forward 
slice from I21 is empty, meaning that it did not play any role in the 
execution, whereas a forward slice from I22 is 

1' <- comp(l,{[l_2] 1_2' <- proj_B (1_2,1_22)}, 
X. \pi_B(x)) 

We can also extract other forms of provenance directly from 
traces. For example, in the second query above, we can see that I2 
in the output "comes from" I12 in the input since it is copied by the 
projection operation I'l «~ projg(Zi, Z12). Similarly, if we inspect 
the forward trace slice from I22, we can see that the labels ?2 and 
I' in the output mat "depend on" I22, and that the edge (I', 12) is 
"produced" by the comprehension from the edge (Ijh)- 

Synopsis The structure of the rest of this paper is as follows. Sec- 
tion[2]reviews the nested relational calculus, and introduces an op- 
erational, destination-passing, store-based semantics for NRC. Sec- 
tion[3]defines provenance traces and introduces a traced operational 
semantics for NRC queries and a trace adaptation semantics for ad- 
justing traces to changes to the input. Section \5/2\ establishes the 
key metatheoretic and semantic properties of traces. Section|4]dis- 
cusses extracting other forms of provenance from traces, and Sec- 
tion[6]briefly discusses trace slicing and simplification techniques. 
We discuss related and future work and conclude in Sections|7}{8l 

2. Nested relational calculus 

The nested relational calculus jBuneman et al. | [T99l) , or NRC, is 
a simply-typed core language, closely related to monadic compre- 
hensions ( Wadler 1992). The NRC that is as expressive as standard 
database query languages such as SQL but has simpler syntax and 
cleaner semantics. (We do not address certain dark comers of SQL 
such as NULL values.) The syntax of NRC types r G Type is as 
follows: 

T ::= int | bool | ri x r2 | {r} 

Types include base types such as int and bool, pairing types ri x T2, 
and collection types {r}. Collection types {r} are often taken to 
be sets, bags (multisets), or lists; in this paper, we consider multiset 
collections only. We omit first-class function types and A-terms 
because most database systems do not support them. 

We assume countably infinite, disjoint sets Var of variables and 
labels Lab. The syntax of NRC expressions e £ Exp is as follows: 

e ::= i | x | let a: = ei in 62 | (ei, 62) | 7ri(e) 

I 6 I -le I ei A 62 I if eo then ei else 62 

I I {e} I ei U 62 I U{e2 | a; £ 61} | empty(6) 

I i I 61 + 62 I 61 PS 62 I I]{62 I a; G 61} 

Variables and let-expressions, pairing, boolean, and integer oper- 
ations are standard. Labels are used in the operational semantics 
(Section [2.4b . The expression denotes the empty collection; {6} 
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constructs a singleton collection, ei U 62 takes the (multiset) union 
of two collections, and U{e | x £ eo} iterates over a collec- 
tion obtained by evaluating e, applying e(x) to each element of 
the collection, and unioning the results. Note that we can define 
{e I a; G eo} as U{{6} I ^ G ^o}- We include integer constants, 
addition (ei + 62), and equality (ei « 62). Finally, the empty(e) 
predicate tests whether the collection denoted by e is empty, and the 
X]{e I X G eo} operation takes the sum of a collection of integers. 

Expressions are identified modulo alpha-equivalence, regarding 
X bound in e(x) in the expressions Ui^l^^) I ^ ^ ^0}' "lli^i^) I 
X G eo} and let x = eo in e{x). We write e[//2:] for the result of 
substituting a label / for a variable x in e; labels cannot be bound 
so substitution is naturally capture-avoiding. 

2.1 Examples 

As with many core languages, it is inconvenient to program di- 
rectly in NRC. Instead, it is often more convenient to use id- 
iomatic "co mprehension s y ntax" similar to Has kell's list com- 
prehensions dWadleJ I1992I : iBuneman et These can be 
viewed as syntactic sugar for primitive NRC expressions, just as 
in Haskell list comprehensions can be translated to the primitive 
monadic operations on lists. Although we use unlabeled pairs, the 
NRC can also be extended easily with convenient named-record 
syntax. These techniques are standard so here we only illustrate 
them via examples which will be used later in the paper. 

Example 1 Suppose we have relations _R : {{A:\nt, B:\nt, C:int)}, 
S : {(Clint, _D:int)}. Consider the SQL "join" query 

SELECT R.A.R.B.S.D FROM R,S WHERE R.C = S.C 

This is equivalent to the core NRC expression 

Qi = [J{[J{'f r.C = S.C 

then {(A:r.A, B.r.B, D.s.D)} else 

I s e 5} I r e i?J 

Example 2 Given R, S as above, the SQL "aggregation" query 

SELECT 42 AS C, SUM(D) FROM S WHERE C = 2 
UNION 

SELECT B AS C, A AS D FROM R WHERE = 4 

can be expressed as 

Q2 = {(C : 42,1) : X;{if ^*-C = 2then s.DelseO I s e S})} 
U |J{if r.C = 4 then {{C:r.B, D.r.A)} else | r e R} 

Some sample input tables and the results of running Qi and Q2 
on them are shown in Figure[T] The labels r, ri, . . . in are used in 
the operational semantics, as discussed in Section [2!4l 

2.2 Type system 

NRC expressions can be typechecked using standard techniques. 
The typechecking rules are shown in Figure|2l We employ contexts 
F of the form F ::= ■ | F, x:r. 

2.3 Denotational semantics 

The semantics of NRC expressions is usually defined denotation- 
ally. We consider values v £ Val of the form: 

V :■— i\h \ («i,U2) I {v\,. . . ,«„} 

and interpret types as sets of values, as 



where i G Z and h G 
follows: 



lint] = Z = {..., -1,0, 1,2, 

Ibool] = B = {t,f} 
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Output table Qi{A, B, D) Output table Q2(C, D) 
Figure 1. Examples 



e F F h ei : Ti F, x:t \- 62 : T2 



be 



r \- X : T r h let a; = ei in e2 : 7-2 
i G Z ri- ei : int Fi- 62 : int 
F h j : int Fh ei + e2 : int 

F h e : bool F h ei : bool F h 62 : bool 



F h & : bool F h ^e : bool F h ei A e2 : bool 

F h ei : int F I- 62 : int F h e : bool F I- ei : r Fh 62 : r 
F h ei « 62 : bool F h if e then ei else 62 : r 

F h e : {r} 

F I- ei : n r I- 62 : T2 F h e : ri x T2 



F I- empty(e) : bool F h (ei, 62) : ri x r2 F h TVi(e) : n 

Fhe:r F h ei : {r} F h 62 : {r} 

F h : {r} F h {e} : {r} F h ei U 62 : {r} 

F h 60 : {to} F, x:to h e : {r} F h eo : {tq} F, x:ro h e : int 



F h U{e I a; G eo} : {r} 



F h J]{e I x G eo} : int 



Figure 2. Expression well-formedness 

We write A4f,„{X) for the set of finite multisets of values. Figure[3] 
shows the (standard) equations defining the denotational semantics 
of NRC expressions. NRC does not include arbitrary recursive 
definitions, so we do not need to deal with nontermination. 

We write 7 : Var — > Val for a finite function (or environment) 
mapping variables x to values v. We write [F] for the set of all 
environments 7 such that 7(2;) G |[F(2;)| for all x G dom(7). 

The type system given above is sound in the following sense: 

Proposition 1. //F h e : r then {ej : |[Fl |r|. 
2.4 Operational semantics 

The semantics of NRC is usually presented denotationally. For the 
purposes of this paper, we will introduce an operational semantics 
based on stores in which every part of every value has a label. 
This semantics will serve as the basis for our trace semantics, since 
labels can easily be used to address parts of the input, output, and 
intermediate values of a query. Thus, labels play a dual role as 
addresses of values in the store and as "locations" mentioned in 
traces. Note that NRC is a purely functional language and so labels 
are written at most once. 
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Ml 

|let X — ei\n 62}^ 

|ei + e2]7 
E{e I X G eo}]7 
Ml 

[ei A e2]7 
[(ei, 62)17 
N(e)l7 



[{e}]7 = 

[ei U e2]7 = 

[U{e|^eeo}]7 = 

|if eo then ei else e2]7 = 

Id ~ e2]7 = 

|empty(e)l7 = 



7(1) 

[e2]7[2; ^ [eih] 
i 

[ei]7+[e2]7 
b 

([eih, [e2]7) 



{H7} 

U{[e]7[a; t;] h G ieoh} 

Ieil7 if [eoh = t 
|e2l7 if |eol7 = f 

t if [eil7= leah 
f if[eil7 7^|e2h 

t if Iel7 = 
f If ITcHo, ^ lil 



op{l, a 
op(j, (T 
op{ii + ^2, o- 

op{ii » «2, cr 

Op{6, (T 
Op(il A ^2: O" 
Op(^i, (T 
op{('l,'2),cr 
op{0, cr 
op({«}, o- 
op(ii U i2j o" 

op(empty{i), cr) 



<7{0 



(t(«i) +zct(«2) 

- 1 (a(/i) = <7a2)) 



cr(ii) Ab o-(/2) 

ai,;2) 



{1:1} 

t W/) = 0) 
f {ail) ^ 0) 



Figure 4. Definition of op 



(7,1 ^t^c!\l := op(t, a)\ 
cr, Z' ^ ei 4). cr' cr',/ <;= e2[i7a:] 4). cr" /'fresh 
cr, Z let x = ei In 62 JJ- o"" 
cr(r) = f) cr, « <^ 66 J| cr' fT(Z') = (/j , /a) 



cr, / if /' then et else ef ^ cr' cr, / -4= 7ri(i') ^ cr[; : = 
cr, xecr(/o), e JJ.* cr', L' 
a,Z ^ U{e I ^ G Zo} := U 

Tf=^('/n"l /= Jl* rr' 7"' 



Figure 3. Denotational semantics of NRC 

In order to ensure that each part of each value has a label, we 
employ a store mapping labels to value constructors, which can be 
thought of as individual heap cells each describing one part of a 
value. We define value constructors k £ Con as follows: 

k ::= i \ h \ {h,l2) \ {h •■ mx, . . . ,l„ : m„} 

Here, {l\ : mi, . . . ,ln : m,i} denotes a multiset of labels (often 
denoted L,L'), where rrii is the multiplicity of k. Multiplicities 
are assumed nonzero and omitted when equal to 1. Multisets are 
equivalent up to reordering and we assume the elements U are 
distinct. We write AIUN for multiset union and M(BN for domain- 
disjoint multiset union, defined only when dom(M) n dom(7V) — 
0. 

We write Lab{k) for the set of labels mentioned in k. Stores 
are finite maps a : Lab — » Con from labels to constructors. We 
also consider label environments to be finite maps from variables 
to labels 7 : Var — > Lab. 

We will restrict attention to NRC expressions in "A-normal 
form", defined as follows: 

w ::= X I I 

e ::= w \ \et x = ei in 62 | (lOi, W2) \ TVi{w) 
I b I -^w I lui A U12 I if mo then ei else 62 

I i \ Wi + \ X]{e2 \ X £ Wl} \ Wi ^ W2 

I I {w} I wi U IU2 I \J{e2 I X G mi} | empty(w) 

The A-normalization translation is standard and straightforward, 
so omitted. The operational semantics rules are shown in Figure|5] 
The rules are in destination-passing style. We use two judgments: 
a, I e Jj. cr', meaning "in store cr, evaluating e at location I 
yields store a'"; and a,x£L,e i}.* o',L', meaning "in store cr, 
iterating e with x bound to each element of L yields store cr' and 
result labels L'." The second judgment deals with iteration over 
multisets involved in comprehensions; this exemplifies a common 
pattern used throughout the paper. 



a, « ^ E{e 1 e «o} a' [I ■■=Y:'^'W]] 

a,l' ^ e[l/x\i\.a' Afresh 
cr, a;e0, e .(J.* cr, cr, x&{l : m}, e ^* cr', {V : m} 
IT, xGLi, e ^* cri, L'j cF,x&L2,e i],* CT2, L'2 
(7,x&Li © L2,ei}.* (71 (72, L[ e L'2 



Figure 5. Operational semantics 



Many of the rules are similar; for brevity, we use a single rule 
for terms t of the following forms: 

t :■- i\li+l2\li~l2\b\^l\li Ah 

(/i,Z2)M|«'IO}MiU/2|empty(0 

Each term is either a constant, a label, or a constructor or primitive 
function applied to some labels. The meaning of each of these 
operations is defined via the op function, as shown in Figure |4l 
which maps a term t G Term and a store a : Lab Con to a 
constructor. 

When L is a set of labels, we write cr[L] for the multiset of 
constructors : m \ I : m £ L}. This notation is used in 

the rules for IJ and In this notation, the standard definition 
of summation of multisets of integers is ^{ii : nii, . . . ,in ■ 
m„} = I]"=i ij ■ ^i- Similarly, \_\{Lx : mi, . . . , L„ : m„} = 
mi ■ Li U ■ ■ ■ Llm„ ■ !/„}, where m ■ {Zi : fci, . . . , Z„ : = {Zi : 
m ■ ki, . . . ,l„ : m • k„}. 

The iteration rules a,x£L,e i}.* fr',L', evaluate e with x 
bound to each I G L independently, preserving the multiplicity 
of labels. They split L using © and combine the result stores using 
the orthogonal store merging operation ttJo- defined as follows: 

Definition 1 (Orthogonal extensions and merging) We say cri 

and cr2 are orthogonal extensions of cr if cri = a W a'l and 
cr2 = cr tiJcr2 and dom(cr( ) ndom(cr2) = 0, and we write cri tiJo- cr2 
for cr l+l cri l+J crj . 

The operational semantics is illustrated on the Examples [i}{2] 
in Figure[Tl here, the labels r,ri, . . . , s, . . . uniquely identify each 
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Q{wi) = fl{w2) = int Q{wi) = Q{w2) = int 

Kerm i : int fl hterm f 1 + W2 : ilt Q hterm Wl ~ ^"2 : bool 

^ Kerm (u;i, 102) : r2(uii) X f2(«)2) 

f2(«)i) = Q(w2) = bool = bool 

f2 Kerm : bool Q Kerm Wl A W2 ■ bool f2 hterm "•'W ■ bool 

= T n{wi) = {t} = n(w2) 

n hterm : {t} fl Kerm {w} : {r} $7 hterm lOl U 102 : {t} 
= {t} 

f2 hterm empty('u;) : bool Q hterm w : Q{w) 
n Kerm t : T f2 h ei : r' Q, x:t' h 62 : r 

Q\- t : T f2 h let a; = ei In 62 : T 

= n X T2 Q{w) = bool f! h Ct : T C h Cf : t 
Q h 7ri(«)) : Ti f2 h If ui then et else ef : r 

= {r} x:t\- e: {r'} n{w) = {t} H, x:t h e : Int 
n h U{e \x€w} : {t'} Q h J^l^^ \ x e w} : \nt 



Figure 6. Well-formed A-normalized NRC expressions 



* l-con {^1 : mi,. . . ,l„ : m„} : {r} ~ cr, / fc : / ; r 



Figure 7. Store and constructor well-formedness 



part of the input tables R, S and the labels on the results reflect one 
possible labeling that is consistent with examples given later. 

2.5 Type system for A-normalized expressions 

We define typing rules for (normalized) NRC expressions as shown 
in Figure [6] We use standard contexts T ::= ■ \ r,x:T mapping 
variables to types and store types of the form ^' ::— ■ \ 'i/,l:T. 
For brevity, we write for a pair ^, T and Q,{w) for ^'(i) if Z = to 
or T{x) if TO = a; respectively. The judgment ^, F h e : r means 
that given store type 4' and context F, expression e has type r. 

The well-formedness judgment for stores is cr : *]/, or "cr has 
store type ^P". This judgment is defined in Figure [7] using an 
auxiliary judgment ^ hcon k : t, meaning "in stores of type 4', 
constructor k has type r". Note that well-formed stores must be 
acyclic according to this judgment since the last rule permits each 
label to be traversed at most once. The well-formedness judgment 
for environments 7 : Var Lab is vJ* h 7 : F, or "in a store 
with type 4', environment 7 matches context F". The rules are as 
follows: 

*h7:F *(0=r 
^ |_ . . . \]/ h 7,a; Z : F,a:: 1-^ r 

We sometimes combine the judgments and write 4' h cr, 7 : F to 
indicate a : \l/ and 4/ h 7 : F. The operational semantics is sound 
with respect to the store typing rules: 

Tlieorem 1. Suppose 4' h e : r and a : 4'. Then if a, Z e ^ cr' 
then there exists 4*' such that 4''(Z) = r and a' : 4*'. 

2.6 Correctness of operational semantics 

To show the correctness of the operational semantics relative to the 
denotational semantics, we need to translate from stores and labels 
to values. We define the functions cr j,- Z by induction on types as 



follows: 

cr tint I = cr{l) 

cr tbool I ~ 

TriXT2 Z = (cr -\ri TVi(a(l)),a 7r2(fr(Z))) 

T{r} I = W]'r I' I Z' G a{l)} 

We also define cr |r 7 pointwise, so that (cr jp 7)(a^) ~ <^ tr{x) 
"f{x). We can easily show that: 

Proposition 2. If a : <b and I : t e then a ]'r I G |r|. 
Moreover, !/4' h 7 : F then cr jr 7 € [F]. 

The correctness of the operational semantics can then be estab- 
lished by induction on the structure of derivations: 

Proposition 3. Suppose that F h e : r and 4/ h cr, 7 : F. Then 
there exists a' such that a,l ^ 7(e) J], a'. Moreover, for any such 
Tr -/) = a' T. I. 

3. Traced evaluation 

We now consider traces which are intended to capture the "execu- 
tion history" of a query in a form that is itself suitable for querying. 
We define traces T using the terms introduced earlier as follows: 

T ::= Z^t|Z^proj,(Z',Z") |cond,(Z',6,T)IJ |ri;r2 

I Z ^ SUm(Z', 6)a:.e I Z ^ COmp(Z', 0)2:.e 

e ::= {[Zi]Ti : mi,...,[Z„]r„ : m„} 

Terms, introduced above, describe single computation steps. La- 
beled trace collections are multisets of labeled traces [l]T. As- 
signment traces I «— t record that a new label Z was created and 
assigned the value described by trace term t. Projection traces 
I «— projj(Z', I") record that I was created and assigned the value at 
I" , by projecting the i-th component of pair I' . Sequential composi- 
tion traces T\ ; T2 indicate that Ti was performed first followed by 
T2. Conditional traces cond i (I' ,b,T)ll record that a conditional 
expression tested I' , found it equal to boolean 6, and then performed 
trace T that writes to Z. In addition, conditional traces record the al- 
ternative expressions ei and 62 corresponding to the true and false 
branches. Comprehension traces I ^ comp(Z', 0)x.e record that I 
was created by performing a comprehension over the set at I', with 
subtraces O describing the iterations; the expression x.e records the 
body of the comprehension with its bound variable x. Sum traces 
I ^ sum(Z, 0)x.e are similar. 

When the expressions ei, €2, x.e in conditional or comprehen- 
sion traces are irrelevant to the discussion we often omit them for 
brevity, e.g. writing condi(Z', b, T) or comp(Z, 6). 

We define the result label of a trace as follows: 

out(Z <~t) = Z 

out(ri;r2) = out(r2) 

out(cond,(Z',6,r)^^) ^ ; 

out(Z ^ proj^(Z', Z")) — I 

out(Z ^ comp(Z', 6)^.e) = I 

out(Z ^ sum(Z', 9)2:.e) — I 

We define the input labels of a labeled trace set G as in* (6) = {Z : 
m I [Z]r : m G 0}. Similarly, the result labels of Q are defined as 
out* (9) = {out(r) : m | [l]T : m G 6}. Note that we treat both 
as multisets. 

3.1 Traced operational semantics 

We now define traced evaluation, a refinement of the operational 
semantics in Section [2!4l The rules for traced evaluation are shown 
in Figure[8] There are two judgments: a, I ■<= e i}- a' ,T, meaning 
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a, I <= t il- a[l := op{t, a)], I ^ t 

a,l' ^eii).(Ti,Ti a,l<^e2[l'/x]ii.a2,T2 

/' fresh 

<T,l <= let X = e\ in 62 -IJ- 0-2, Ti; T2 

a{l') = b a,l^eiii.a',T 

a, I <^ if I' then et else Cf ij- ct', cond; (i', fe, T)e[ 

Tjl') = ^1,^2) 

cr, I <^ TTil' JJ. a[l := cr{li)], I ^ projj(Z',Zi) 

<T,x£cr{l'),e i).* cr',L',0 

a,l-^\J{e\xe l'}il.a'll := U'^'[L']],l ^ comp(Z', e)a;.e 

o-,a;Go-(i')>e4|.* cr',L',e 

E{e I ^ G «'} := T.'^'iL'WJ ^ sum(Z',0)^.e 

cr, /' <^ e[V2:H o-',T «' fresh 

(T, a;e0, e JJ.* cr, 0,0 cr, xe{/ : m}, e ij.* cr' , {I' : m}, {[/]T : m} 

cr, xGLi, e ^* cri, L'j, 01 cr, xGZ/2, e ^* cr2, Lji ©2 

0-, xSLi ® L2, e JJ* cri 0-2, L'l ® Li,, 01 e 02 



Figure 8. Traced evaluation 

"Starting in store a, evaluating e and storing the result at I yields 
store a' and trace T", and a, x£L, e ij.* o' ,L' , O, meaning "Start- 
ing in store a, evaluating e with x bound to each label in L in turn 
yields store a', result labels L' and labeled traces B". 

Each operational semantics rule relates a different expression 
form to its trace form. Thus, traces can be viewed as explaining the 
dynamic execution history of the expression. (We will make this 
precise in Section [5^ . In particular, terms t are translated to as- 
signment traces. Let-expressions are translated to sequential com- 
positions of traces. For these expressions, it would be superfluous 
to record additional information such as the values of the inputs 
and outputs, since this can be recovered from the input store and 
the trace (as we shall see below). However, more detailed trace 
information is needed for some expressions, such as projections, 
conditionals, comprehensions, and sums. Their traces record some 
expression annotations and some information about the structure of 
the input store. Conditionals record the boolean value of the condi- 
tional test as well as both branches of the conditional; comprehen- 
sions and sums record the labels and subtraces of the elements of 
the input set as well as the body of the comprehension. This infor- 
mation is necessary to obtain the fidelity property (Section lJ!2t and 
to ensure that we can extract other forms of provenance from traces 
(Section|4}. 

Example 3 Figure |9] shows one possible trace resulting from nor- 
malizing and running query Qi from Example[T]on the data in Fig- 
ure [T] Similarly, Figure [Tol shows a possible trace of the grouping- 
aggregation query Q2 from Example |2] Since the example queries 
use record syntax, we use terms such as [A : I) and traces I ^ 
proj^(Z',Z") for record construction and field projection. These 
operations are natural generalizations of pair terms and projection 
traces. For brevity, the examples omit expression annotations. 

We will need the following property: 

Lemma 1. Ifa,l^ei}.a',T then out(r) = Z. 

Proof. Easy induction on derivations. □ 

4. Provenance extraction 

As we discussed in Section [T] a number of forms of provenance 
have been defined already in the literature. Although most of this 
work has focused on flat relational queries, several techniques have 



1 <- comp(r,{ 

[rl] xll <- proj_C(rl,rl3) ; xl <- comp(s,{ 

[si] xlll <- proj_C(sl,sll) ; xll2 <- xll = xlll; 

cond(xll2,f ,xll3 <- {}) , 
[s2] xl21 <- proj_C(s2,s21) ; xl22 <- xll = xl21; 

cond(xl22,f ,xl23 <- {}) , 
[s3] xl31 <- proj_C(s3,s31) ; xl32 <- xll = xl31; 
cond(xl32,t,lll <- proj_A(rl,rll) ; 

112 <- proj_B(rl,rl2) ; 

113 <- proj_D(s3,s32) ; 

11 <- (A:111,B:112,D:113) ; 
xl36 <- {11})}) , 

[r2] x21 <- proj_C(r2,r23) ; x2 <- comp(s,{ 

[si] x211 <- proj_C(sl,sll) ; x212 <- x21 = x211; 

cond(x212,f ,x213 <- {}) , 
[s2] x221 <- proj_C(s2,s21) ; x222 <- x21 = x221; 

cond(x222,f ,x223 <- {}) , 
[s3] x231 <- proj_C(s3,s31) ; x232 <- x21 = x231; 

cond(x232,t,121 <- proj_A(r2,r21) ; 

122 <- proj_B(r2,r22) ; 

123 <- proj_D(s3,s32) ; 

12 <- (A:121,B:122,D:123) ; 
xl26 <- {12})}) , 

[r3] x31 <- proj_C(r3,r33) ; x3 <- comp(s,{ 

[si] x311 <- proj_C(sl,sll) ; x312 <- x31 = x311; 

cond(x312,f ,x313 <- {}) , 
[s2] x321 <- proj_C(s2,s21) ; x322 <- x31 = x321; 

cond(x322,f ,x323 <- {}) , 
[s3] x331 <- proj_C(s3,s31) ; x332 <- x31 = x331; 

cond(x332,f ,x333 <- {})})}) 



Figure 9. Example trace for query Q\ 



111' <- 42; xl <- 2; 
112' <- sum(s,{ 

[si] xll <- proj_C(sl,sll) ; xl2 <- xll = xl ; 

cond(xl2,t, xl3 <- proj_D(sl , sl2) ) , 
[s2] x21 <- proj_C(s2,s21) ; x22 <- x21 = xl; 

cond(x22,t, x23 <- proj_D(s2 , s22) ) , 
[s3] x31 <- proj_C(s3,s31) ; x32 <- x31 = xl; 
cond(x32,f, x33 <- 0)}); 
11' <- (C:lll' ,D:112') ; x <- {11'}; yl2 <- 4; 
y <- coinp(r,{ 

[rl] yll <- proj_C(rl,rl3) ; yl2 <- yll = yl ; 

cond(yl2,f , yl3 <- {}) , 
[r2] y21 <- proj_C(r2,r21) ; y22 <- y21 = yl ; 

cond(y22,f ,y23 <- {}) , 
[r3] y31 <- proj_C(r3,r31) ; y32 <- y31 = yl ; 
cond(y32,t,121' <- proj_B(r3,r32) ; 

122' <- proj_A(r3,r31) ; 
12' <- (C:121' ,D:122') 
y33 <- {12'})}); 

1' <- X U y 



Figure 10. Example trace for query Q2 
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recently been extended to the NRC. Thus, a natural question is: are 
traces related to these other forms of provenance? 

In this s ection we describe a lgorithms for extracting where- 
prove nance jBuneman et al.l2007h. dependency prove nance dChenev 
l2007h . and semiring provenance jFoster et al.ll2008l) from traces. 
We will develop extraction algorithms and prove them correct 
relative to the existing definitions. However, our operational for- 
mulation of traces is rather different from existing denotational 
presentations of provenance semantics, so we need to set up ap- 
propriate correspondences between store-based and value-based 
representations. Precisely formulating these equivalences requires 
introducing several auxiliary definitions and properties. 

We also discuss how provenance extraction yields insight into 
the meaning of other forms of provenance. We can view the extrac- 
tion algorithms as dynamic analyses of the provenance trace. For 
example, where-provenance can be viewed an analysis that identi- 
fies "chains of copies" form the input to the output. Conversely, we 
can view high-level properties of traces as clear specifications that 
can be used to justify new provenance-tracking techniques. 

The fact that several distinct forms of provenance can all be ex- 
tracted from traces is a clear qualitative indication that traces are 
very general. This generality is not surprising in light of the fidelity 
property, which essentially requires that the traces accurately repre- 
sent the query in all inputs. In fact, the provenance extraction rules 
do not inspect the expression annotations a;.e, ei , 62 in comprehen- 
sion and conditional traces; thus, they all work correctly even with- 
out these annotations. Also, the extraction rules do not have access 
to the underlying store a\ nor do they need to reconstruct the in- 
termediate store. The trace itself records enough information about 
the store labels actually accessed. 

We first fix some terminology used in the rest of the section. 
We consider an annotated store cr'-'^' to consist of a store cr and 
a function h : dom(cr) A assigning each label in a to an 
annotation in A. We also consider several kinds of annotated 
values. In general, a value v £ Vaf''^^ with annotations a from 
some set A is an expression of the form 



et 



iD 



V 

w 



i\h\ (ui, U2) I {ui, ■ • • ,Wn} 



This syntax strictly generalizes that of ordinary values since ordi- 
nary values can be viewed as values annotated by elements of some 
unit set {★}, up to an obvious isomorphism. Also, we write \v\ for 
the ordinary value obtained by erasing the annotations from v. This 
is defined as: 



= i \y'\^h \{vi,V2Y\ = {\vi\,\v2\) 

.,«n}| = {\vi\,...,\v„\} 



Moreover, we define \ w^\ = w and [ui^] = x. 

Given an yl-annotated store a^'^\ we can extract annotated 
values using the same technique as extracting ordinary values from 
an ordinary store: 



(h) 



h{i) 



Moreover, for 7 ; Var — > Lab we again write a^'^^ |p 7 : Var — > 
FaZ'^' for the extension of the annotated value extraction function 
from labels to environments. Similarly, for L a collection of labels 



CtC^),/ ^ t i^w := t](h[i: = whcro{t,h)l) 

tTC^^Z' ^ ei J^w 't'C'') a'C*'),/ e2[r/x] f^"'''"' «' fresh 
crC*),/ ^ let X = ei in 62 ij-w o-"(^") 
ail') = ih,l2) 



a{l') = b o-C'),^ ^ 66 o-'C^') 
o-C") ,1 ^\fl' then et else ef ij-w cr'^'^'^ 

aWj^ U{e I X g I'} ii-w := \_\ a' [L']]Wl-- = ^) 



aW,x G 0,e J>;^ (t(^), 



a; G Li,e^^ a]^^',L[ , x Li, e i^^^, a. 



(hitiJfths) 



L'l ® L'2 



o-C"),/' ^ e[l/x] i^w o-'''''^ fresh 
o-(''),x G {I : m},e^^ u"-^'\{V : m} 



we write a 



(h) t-4 



t{V} Lfor {a I ■■ m \ I : m. e L}. 



Figure 12. Where-provenance, operationally 



4.1 Where-provenance 

As discussed by (Buneman et al."200l', '2007^, where-provenance 
is informati on about "where an output value came from in the 
input". Bu neman et al.l i2007f) defined where-provenance seman- 
tics for NRC queries via values annotated with optional annota- 
tions A± — Aw {-L}- Here, ± stands for the absence of where- 
provenance, and A is a set of tokens chosen to uniquely address 
each part of the input. 

The idea of where-provenance is that values "copied" via vari- 
able or projection expressions retain their annotations, while other 
operations produce results annotated with ±. We use an auxiliary 
function 

where(Z, h) — h{l) 
where(t, h) = _L {t ^ I) 

that defines the annotation of the result of a term t with respect 
to h : Lab — > to be preserved \f t = I and otherwise _L. 
iBuneman et al. did not consider integer operations or sums; 

we support them by armotating the results with ±. 

We first review the denot ational presentation of where-provenance 
from ( IBuneman et aU2007h . Figure[TT] shows the semantics of ex- 
pressions e as a function VK|e] mapping contexts 7 : Var —> 
VaZ''*^^ to y4x-annotated values. 

In Figure [T2I we introduce an equivalent operational formula- 
tion. We define judgments a^'^\l ^ e ij-w o"'''' ' for expression 
evaluation and a^'^\x G L,e Jlw o"'''' \L' for iteration, both 
with where-provenance propagation. 

It is straightforward to prove by induction that: 

Theorem 2. 

L Suppose F h e : r and \- a,-y : T. Then cr'''' , / <— 7(e) 4vK 
ct'C*'' ifandonlyifWle'lia^''^ jp^ 7) = a'*'''' j:^^ I. 

2. Suppose r,x : T h e : {r'} and ^' h cr, 7 : F. Then 
a^''\x G L,7(e) i^l^, cr'C^'^L' if and only if {WMjIx := 
v]\ve at") Tf.-^j L} = a'C') tf^-Vj L' . 
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h,l ^ t -^w h[l := where(t, h)] 



h,Ti~^wh' h',T2~^wh" 
h, Ti ; T2 -^w h" 
h, T h' 



h,l^ proji{l',l") -^w h[l := h{l")] /i, cond; (/', 6, T) -^w h' 



h,0-~^*^ h' 



h, e h' 



h, I ^ comp(r, e) -^w h'[l := ±] h,l ^ sum(/', 0) -^w h'[l := ±] 
h,Bi ■^'^ hi h,02 -^w ^2 h,T -^w h' 



/i,0l ©62 



/ilWh/i2 h,{[l]T:m} 



h' 



Figure 13. Extracting where-provenance 
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Output table Qi{A, B, D) Output table Q2(C, D) 
Figure 14. Where-provenance extraction examples 



The where-provenance extraction relation is shown in Fig- 
ure ll3l we define judgment h, T -^w h' , which takes input anno- 
tations h and propagates them through T to yield output annota- 
tions h', and judgment h, O 

w which propagates annotations 
through a set of traces. Where-provenance extraction can be shown 
correct relative to the operational where-provenance semantics, as 
follows: 

Theorem 3. 

1. Suppose (7,1 e i\. a' ,T and h : dom(cr) A± is given. 
Then a'^'^\l e JJ-w a'''''' holds if and only ifh, T h' 
holds. 

2. Ifa,xeL,e 4* a' , L' , thena^''\x £ L,e 1^1^ (j"^'^'\L' if 
and only ifh, O -^w h' . 

Example 4 Figure [14] shows the results of where-provenance ex- 
traction for Examples [T}l2] For the inputs and results in Figure [T] 
the field values copied from the input have provenance links to 
their sources, whereas values computed from several values have 
no where-provenance (_L). 

Definition 2 A copy with source I' and target / is a trace of either 
the form I I' 01 I *— projj(/", I'). A chain of copies from lo 
to In is a sequence of trace steps Ti ; . . . ; r„ where each step Ti is 
a copy from k-i to k. We say that a trace T contains a chain of 
copies from I' to I if there is a chain of copies from /' to I all of 
whose operations are present in T. 

dom(cr) dom((7)x be the (lifted) identity function 



Let idn 
on a. 



Proposition 4. Suppose cr, / e 4 ,T and ida ,T -^w h. 

Then for each I' G dom((j'), h{l') 7^ 1. if and only if there is a 
chain of copies from h{l') to I' in T. 

Moreover, where-provenance can easily be extracted from a 
trace for a single input or output label rather than for all of the 
labels simultaneously, simply by traversing the trace. Though this 
takes time 0(|r|) in the worst case, we could do much better if the 
traces are represented as graphs rather than as syntax trees. 



4.2 Dependency provenance 

We next consider extracting the dependency provenance introduced 
in our previous work ( Cheney et al. 2007). Dependency provenance 
is motivated by the concepts of dependency that underlie program 
slicing (Venkatesh 1991) and noninterference in information flow 
security, as f ormalized, for instance, in the Dependency Core Cal- 
culus ( A badi et al.lll999t) . We consider NRC values annotated with 
sets of tokens and define an annotation-propagating semantics. 

Dependency provenance annotations are viewed as correct 
when they link each part of the input to all parts of the output 
that may change if the input part is changed. This is similar to 
non-interference. The resulting links can be used to "slice" the in- 
put with respect to the output and vice versa. IChenev et al.l ( l2007h 
established that, as with minimal program slices, minimal depen- 
dency provenance is not computable, but gave dynamic and static 
approximations. Here, we will show how to extract the dynamic 
approximation from traces. 

Dependency provenance can be modeled using values v £ 
Val'^'^'^^^^ annotated with sets of tokens from A. We introduce 
an auxiliary function dep(t, h) for calculating the dependences of 
basic terms t relative to annotation functions h : Lab ViA). 

dep(j, h) = dep(6, h) = dep(0, /i) = 

dep({/}, h) = dep(^/, h) = dep(;, h) = h{l) 

dep(empty(i), ft) — h{l) 

dep{h+l2,h)=dep{li^l2,h) = h{li) U hih) 

dep(Zi A/2,/1) = dep((/i,/2),ft) = h{li) U hih) 

dep{hUl2,h) = h{h)Uh{l2) 

Essentially, dep simply takes the union of the annotations of all 
la bels mentioned in a te rm. 

IChenev et alj l l2007h defined dynamic provenance-tracking de- 
notationally as a function -D|e] mapping contexts 7 : Var —> 
Va/'''''^'^ to 'P(yl)-annotated values. We present this definition in 
Figure [TS] Note that we use an auxiliary notation v^" to indicate 
adding an annotation to the toplevel of a P (A) -annotated value. 
That is, = w'"^". 

Next we introduce an operational version. We define judgments 
a^'^\l <~~ e JJ._D cr'^'' ^ for expression evaluation and a'^'^\x £ 
I/, e 4i) <^''^'^ ^L''"' for comprehension evaluation, both with 
dependency-provenance propagation. Note that the iteration rules 
maintain an annotation set a collecting the top-level annotations of 
the elements of L' . 

It is straightforward to prove by induction that: 

Tlieorem 4. 

1. Suppose F h e : r and <!/ h cr, 7 : F. Then a^^\ I ^ e i}-D 
a'C^') ifandonlyifDiejiaC^^ Tr*'*' 7) = Tr*-*' I- 

2. Suppose r,a; : r h e : {r'} and h cr, 7 : F. Then 
a^''\x e L,e cr'''''' , L''") if and only if {Dlel'y[x ~ 

I' e L'}. 



^1^,^.') L'anda = U{ain\ 



We define the dependency-provenance extraction judgments 
h,T h' and h,0 h' in Figure [TS] As usual, we have 
two judgments, one for traversing traces and another for traversing 
trace sets. 



Tlieorem 5. L Suppose a, I 



V{A). 
h,T 



Then 



Ah) 



h' holds. 



,1 



= e JJ. cr',T and h : doin((7) 
4d ct'^'' ' holds if and only if 
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l^Iei « 6217 




r if [Wle^hl = LW^Ie2l7j 
\ if LM/Ieil7j / LW^Ie2l7j 


M^|empty(e)|7 




r if [Wlehi = 
1 if [Wlehi ^ 



Figure 11. Where-provenance, denotationally 
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Figure 15. Dependency-provenance, denotationally 
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a-W^l «_ t i}.D ^[l ■■= t](''[i: = dep{t,h)l) 

^(h)^;' ^ ei JJ,i3 a'C"') ^ ezl/'/x] JJ-D fr"*''"' ('fresh 

crC'),/ ^ let X = ei in e2 ii-n cr"'''") 

try) = {h,h) 

aWj ^ ^^(i') ^[i — ^(;^)](h[i: = h(I,)uh(i')l) 

cr{l') = b aWj^ebi^pa"-''''! 

rjWj^ \fl' then et else ef ^i, f^'{h' ll:=h' (t)uh' (l')]) 

(tC^),/ ^ U{e I a; G ('} := □f7'[L']]('''l'^='''('')^''l) 

crC"),!- e (7(Z),e ^J, <t'('''),L'(") 

o-C"),/' ^ e[V2:] -Hd o-'C^') i' fresh 
(T(''),a; G {/ : mj.e^Jj ct'C^'),!;' : m}('''('')) 

Figure 16. Dependency-provenance, operationally 

Ai = {r,s, ri,r2,r3,si,S2, S3, ri3,sii,r23, 321,7-33,531} 

A2 = {r, S,r-l,r2,r3,Si,S2,S3,ri2,7-22,r32} 
^3 = {Sll, S12, S2I, S22, S31} 
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Output table Qi(A, B, D) Output table Q2(C, D) 



Figure 17. Dependency provenance extraction examples 

2. If a,x£L,e ij.* a',L',0 and h : dom(cr) 'P(A) then 
a^''\x e i,e 4o a'C^'^L'*"' holds if and only ifh,e -^^ 
ft''"' holds. 

Example 5 FigureflTl shows the results of dependency provenance 
extraction for Examples [T}j2] The dependency-provenance is simi- 
lar to the where-provenance for several fields such as Zn . The rows 
l'i,l'2 have no (immediate) dependences. The top-level labels 1,1' 
depend on many parts of the input — essentially on all parts at 
which changes could lead to global changes to the output table. 

4.3 Semiring provenance 

iGreen et"an ( l2007l) introduced the semiring-annotated relational 
model. Recall that a (commutative) semiring is an algebraic struc- 
ture (-ftT, Ok , 1a', +if , -k) such that (7^,0,+) and {K,l,-) are 
commutative monoids, is an annilhilator (that is, ■ a:: = = a:: • 0) 
and • distributes over +. They considered A'-relations to be ordi- 
nary finite relations whose elements are annotated with elements 
of K, and interpreted relational calculus queries over /('-relations 
such that many known variations of the relational model are a spe- 



h,Ti -^D h' h',T2-^Dh' 
h,l -^D h[l := dep(t, h)] h,Tv,T2 h" 



h,l ^ projj(i',ii) -^o hV '■= 


h{l')Uh{li)] 


h, T h' 




h, condiH' ,b,T) ^0 h'l^'' '■= 


h'{V) U h'{l)] 


h, h't") 




h, I <— comp(/, 0) -^r> h' [I 


= h'{l')Ua] 


h, -^1, 




h,l ^ sum(«,0) -^0 h'[l :-- 


= /).'(/') U a] 


h,T - 




ft,0-l,hW h,{[l]T}^*^ 


/j/(h'(out(T))) 


h,ei h[''^'> h,Q2 


"'2 



h,0i ® 02 (hi l±lh h2)('"l^''2) 



Figure 18. Extracting dependency provenance 



cial case. For example, ordinary set-based semantics corresponds to 
the semiring (B, f, t, V, A), whereas the multiset or bag semantics 
corresponds to the semiring (N, 0, 1, +, ■)• 

The most general instance of the /^-relational model is obtained 
by taking K to be Ihe free semiring N[X] of polynomi als with co- 
efficients in N over indeterminates X, and lGreen et al.l ^2007) con- 
sidered this to yield a form of provenance that they called how- 
provenance because it provides more information (than previous 
approaches such as why-provenance or lineage) about how a tu- 
ple was derived from the input. Lineage and why-provenance can 
also be obtained as instances of the semiring model (although the 
init ial paper glossed ov er some subtleties that were later clarified 
by ( Buneman et al. 2008)). Thus, if we can extract semiring prove- 
nance from traces,jy^ can also extract lineage and why-provenance. 

iFoster et alj d2008l) extended the semiring- valued model to the 
NRC, and we will work in term s of this version. Formally, given 
semiring AT. lFoster et al.l | [2b08h interpret types as follows: 

A'Jintl = Z J^IboolI = B 
Kin X T2j = Klnl x Klr2l 

KlMi = {/ : /^H ^ A' I supp(/) finite} 

where supp(/) = {x G X | f{x) 7^ Ok} provided f : X ^ K. 
In other words, integer, boolean and pair types are interpreted nor- 
mally, and collections of type r are interpreted as finitely-supported 
functions from KItJ to K. For example, finitely-supported func- 
tions X B correspond to finite relations over X, whereas 
finitely-supported functions X N correspond to finite multisets. 
We overload the multiset notation {vi : fci , . . .} for /f-collections 
over A"- values v to indicate that the annotation of Vi is ki. We write 
K- Val for the set of all A-values of any type. 

We write IC(X) for {f : X ^ K \ supp(/) finite}. This forms 
an additive monad with zero. To simplify notation, we define its 
"return" (rj/c), "bind" (•«;), zero (Ojc), and addition operators 
as follows: 

riic{x) — Ay.if a; = J/ then 1_R- else Ojf 

Ok = Ax. Ok 
f +Kg = ^x.f(x) +K g{x) 

Moreover, if / : X A' and k £ K then we write fc / for the 
"scalar multiplication" of v by k, that is, k ■ f = Xx.k -k f{x). 
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Klxj-y = 'y{x) ^(h) _ ^ ^ ^ ^[^ _ j](h[i:=somiring(t,h)]) 

/rjlet a; = ei in eah = Kfeihlx ^ Kleij-f] j, ^ ^,(h') , ; ^ eal/'/x] /'fresh 
Ki^eh = ^Kleh a{l') = {h,h) 



crC'),; ^ if «' then et else e, c'*'''^ 



ifjif eo then ei else e2]]7 



eah = \) '%lf'VZm'"V .W.^^elZ/xl^^yC') /'fresh 



Figure 19. Semiring provenance, denotationally 



Figure 20. Semiring provenance, operationally 



[Foster et"aD ( 1200 8h defined the semantics of NRC over K- 
values denotationally. Figure [79l presents a simplified version of 
this semantics in terms of the K. monad operations; we interpret an 
expression e as a function fr om environments 7 : Var —* K- Val to 
results in K- Val. Note that [Foster et alj j2008f) 's version of NRC 
excludes emptiness tests, integers, booleans and primitive opera- 
tions other than equality, but also includes some features we do not 
consider such as a tree type used to model unordered XML. Most of 
the rules are similar to the ordinary denotational semantics of NRC; 
only the rules involving collection types are different. A suitable 
type soundness theorem can be shown easily for this interpretation. 

Semiring-valued relations place annotations only on the ele- 
ments of collections. To model these annotations correctly using 
stores, we annotate labels of collections with A"-collections of la- 
bels K,{Lab). As a simple example, consider store [Zi := 1, 12 ~ 
2, Is :— 1, 1 :— {li : 2, Z2 : 3, ^3}] and annotation function h{l) — 
[li := fci, Z2 := k2,h fca]- Then / can be interpreted as the K- 
valuejl : 2ki + k3,2 : 3^2}. The reason for annotating collections 
with K.{Lab) instead of annotating collection element labels di- 
rectly is that due to sharing, a label may be an element of more than 
one collection in a store (with different i^-annotations). For exam- 
ple, consider [h ■- l,l2 ■- 2,1 := {li : 2,12}, I' ■■= {h : 42}]. 
If we annotate I with [h ki,l2 ^ k2] and I' with [h := fca] 
then we can interpret Z as {1 : 2ki, 2 : k2} and /' as {1 : 42fe3} 
respectively. If the annotations were placed directly on li, I2 then 
this would not be possible. 

We will consider annotation functions h : Lab — > IC{Lab)± 
such that if I is the label of a collection, then h{l) maps the 
elements of I to their values. Labels of pair, integer, or boolean 
constructors are mapped to ±. In what follows, we will use an 
auxiliary function semiring(/, h) to deal with the basic operations: 

semiring ft) = h{l) 

semiring(0, h) = Ok 

semiring({Z}, ft) = r]ic{l) 

semiring(Zi U Z2, ft) = h{li) +ic h{l2) 

semiring(f, ft) = ± (otherwise) 

As before, we consider an operational version of the denota- 
tional semantics of NRC over A'- values. This is shown in Fig- 
ure|20] As usual, there are two judgments, one for expression eval- 
uation and one for iterating over a set. Many of the rules not in- 



volving collections are standard. The semiring function handles 
the cases for 0, U, and {e}. 

There is a mismatch between the denotational semantics on K- 
values and the operational semantics. The latter produces annotated 
stores, and we need to translate these to if-values in order to be able 
to relate the denotational and operational semantics. The desired 
translation is different from the ones we have needed so far. We 
define 

^"^H^;;/ = ail) 

<T<'^Hnx.. / = ^i.^T^'^f, Zi) {a{l) = {h,l2)) 

aC^Hf.}/ = Xx.j:{hil){l') 

I' e dom{a{l)),a^''^ i\f,y l' = x} 

The translation steps for the basic types and pairing are straight- 
forward. For collection types, we need to construct a JC-collection 
corresponding to /; to do so, given an input x we sum together the 
values h{l) {I') for each label I' in Aom^ail)) such that the if-value 
of I' in a''*' is x. In particular, note that we ignore the multiplicity 
of I' in a{l) here. 

We can now show the equivalence of the operational and deno- 
tational presentations of the semiring semantics: 

Theorem 6. 

1. Suppose r h e : r and 'I' h ct, 7 : F. Then a^'^\l ^ e JJ-k 
a'('''' if and only ifKlej{a^''^ Ij-f 7) = a'^^'' -ft-f I. 

2. Suppose r,x : r h e : {r'} and \E' h cr, 7 : F. Then 
o^''\x e L,e J|k ct'C^'^L' if and only if {Klel-^lx ■- v] \ 

Our main result is that extraction semantics is correct with 
respect to the operational semantics: 

Theorem 7. LIfa,l^ei}. a', T then cr'''' ,1 ^ e ct'^'"'' 
holds if and only ifh, T h'. 

2. Ifa,x€L,eil.* a' , L' ,e then a^^\x e L*''', e a'^^'\ L"^''"' 
if and only ifh, k, Q -~~>k ft', k'. 
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Figure 21. Semiring provenance extraction examples 



h,Ti-^Kh' h',T2 



h,l ^ t h[l := semiring(t, h)] 



h,Tv,T2 h" 
h, T h' 



h,l ^ proji(Z',/i) h[l := h{k)] h,condi{l' ,b,T) -^k h' 
h,h(V),e -^l. h',k' 
h, I ^ comp(/', e) -^K h'[l := k' h'] 

h, k, 61 hi, ki h, k, ©2 ■^K '12, ^2 



h, k, -^K h, Ojc h, k, 01 e ©2 -^K hi (±1^ /i2, +K. ^2 
h, T -^K h' 
h, k, {[l]T : m} h', k{l) -K ryK;(out(T)) 

Figure 22. Extracting semiring provenance 



Example 6 Figure [2T| shows the result of semiring-provenance 
extraction on Q\. Here, we write R\,S\, etc. for the annotations 
of r\ in r, si in s, etc. respectively. The second query Q2 involves 
5]] expressions, which are not handled by the semiring model. 
Instead, the second part of Figure 1211 shows the result of semiring 
provenance extraction on Q3, = {{A : x.A,D : x.D) \ x G Qi}, 
where we have merged the two copies of the record {A : 1, D : 7) 
together and added their A'-values. 



5. Adaptation 

5.1 Adaptive semantics 

We also introduce an adaptive semantics that adapts trac es to 
changes in the input. Similarly to change-propagation in AFL jAcar et alJ 
120061 . we can use the adaptive semantics to "recompute" an ex- 
pression when the input is changed, and to adapt the trace to be 
consistent with the new input and output. However, unlike in AFL, 
our goal here is not to efficiently recompute results, but rather to 
characterize how traces "represent" or "explain" computations. We 
believe efficient techniques for recomputing database queries could 
also be developed using similar ideas, but view this as beyond the 
scope of this paper. 

We define the adaptive semantics rules in Figure[23] Following 
the familiar pattern established by the operational semantics, we 
use two judgments: (j,T r\ a' , T', or "Recomputing T on cr yields 
result a' and new trace T'", and a, x(^L, e, O rv* a' , L' , O', or 
"Reiterating e on a for each x £ L with cached traces O yields 
result a', result labels L', and new trace B'". 

Many of the basic trace steps have straightforward adaptation 
rules. For example, the rule for traces I <— t simply recomputes the 
result using the values of the input labels in the current store. For 
projection, we recompute the operation and discard the cached la- 
bels. Adaptation for sequential composition is also straightforward. 
For conditional traces, there are two rules. If the boolean value of 
the label is the same as that recorded in the trace, then we proceed 
by re-using the subtrace. Otherwise, we need to fall back on the 
trace semantics to compute the other branch. 



a, I t rv a[l := op{t, cr)], I <— t 
cr, Ti r^cr',T[ cr',T2 cr", Tl, 
o-,Ti;T2 r>(T",T{;r^ 

<r{V) = {1{,1'2) 

a, I ^ proii{l',li) r\a[l := /,;],/ ^ projj(/',/^) 
b' = u{V)^h « <(= ej, ^ (t',T' 
cr, cond,(/', b, T)%1 cr', cond, («' , 6', T')e? 

a{l') = b a,Trxa',T' i = out(T') 
cr, condi(r , 6, T)ll cr', cond,(/', b, T')ll 
a,xea{l'),e,e cr',L',e' 
<T,l ^ comp(i',e)^.e r^a'H := \Ja' [L']],l ^ comp(«',e')i 
a, xecr(l'), e, 8 rv* cr', L', 6' 
a,l ^ sum(«',0)^.e r^a'll := J2 [L']],l ^ sum(«',e')^. 



cr, xe0, e, e rv* tr, 0, 
[Z]T e cr,T r~vcr',T' 



a, x£{l : m}, e, r^* cr', {out(T') : m}, {[l]T' : m} 
I ^ in*(0) I' fresh cr, V <= e[l/x\ ^ a', T' 
a, x&{l : m}, e, r>* cr', {/' : m}, {[l]T' : m} 
cr, xdLi, e, r\* cri, L'j^, 0i cr, x(iL2, e, r\* cr2, Lji ^2 
a, xeLi e ^2, e, rv* cri \±l„ a2,L[ ® L^, 0i ® 02 

Figure 23. Trace adaptation semantics 



The rules for comprehension and summation traces make use 
of the iteration adaptation judgment. In each case, we traverse the 
current store value of Iq. For each label I in this set, we re-compute 
the body of the comprehension, re-using a trace [l]T if present in 
B, otherwise evaluating e[l/x] in the traced semantics. The iterative 
judgments return a new labeled trace set O and its return labels L' . 
Note that trace adaptation ignores the multiplicity of cached traces. 
When we re-use a cached trace [l]T on a label I with multiplicity 
m, we simply rerun the trace and use m as the multiplicity of the 
result label and new trace. 

Example 7 TODO 

5.2 Metatheory of adaptation 

We now investigate the metatheoretic properties of the traced eval- 
uation and trace adaptation semantics. 

We first show that the traced semantics correctly implements 
the operational semantics of NRC expressions, if we ignore traces. 
This is a straightforward induction in both directions. 

Theorem 8. For any cr, I, e, a', we have a,l ^ e i}, cr' if and only 
if a, I <i= e i}.a', T for some T. 

We now turn to the correctness of the trace semantics. We 
can view the trace semantics as both evaluating e in a store a 
yielding a' and translating e to a trace T which "explains" the 
execution of e. What properties should a trace have in order to be 
a valid explanation? We identify two such properties which help to 
formalize this intuition. They are called consistency and fidelity. 

Consistency The trace is meant to be an explanation of what 
happened when e was evaluated on a. For example, if the trace 
says that I <— li + h but / o-'('i) + cr'(^2) then this 

is inconsistent with the real execution. Also, if the trace contains 
condi(r, f, T)ll, but /' actually evaluated to t in the evaluation of 
e, then the trace is inconsistent with the actual execution. As a third 
example, if the trace contains I' ^ comp{l, {[li]Ti, [l2]T2})x.e 
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a{l)=opit,a) a{l') = {h,l2) o{l) = a{U) 
a \=l t a \=l projj(/', h) 

a\=Ti cj^T-i a{l') = b a ^ T out(r) = / 



a\=Ti;T2 


a^condi{l',b,T)ll 


a{l')^m*{e) a 


K© cr(/) = |Jcr[out*(e)] 


a\=l- 


!- comp(;', e)x.e 


a{l') = m*(e) a 


Ke <7(/) = E'^[out*(e)] 


a \= I 


«- SUm(;', Q)x.e 


a \=* Qi 





crK0 crK©i©©2 a ^* {[l]T : m} 



Figure 24. Declarative semantics of traces 

wliereas a{l) — 1/2,^3} then the trace is inconsistent because it 
does not correctly show the behavior of the comprehension over I. 

To formalize this notion of consistency, observe that we can 
view a trace declaratively as a collection of statements about the 
values in the store. We define a judgment a \= T, meaning "T 
is satisfied in store cr". We also employ an auxiliary judgment 
a \=* O, meaning "Each trace in Q is satisfied in store a". The 
satisfiability relation is defined in Figure[24l 

Theorem 9 (Consistency). If a, I ^ e i^- a' ,T then a' \= T. 

Fidelity Consistency is a necessary, but not sufficient, require- 
ment for traces to be "explanations". It tells us that the trace records 
valid information about the results of an execution. However, this 
is not enough, in itself, to say that the trace really "explains" the 
execution, because a consistent trace might not tell us what might 
have happened in other possible executions. To see why, consider a 
simple expression if ly then Ix + Iz else h run against input store 
[Ix = 42, ly ^ t,lz = 5}. Consider the traces, Ti = I ^ U + h 
and T2 = I ^ 47. Both of these traces are consistent, but neither 
really "explain" what actually happened. Saying that I ^ Ix + h 
or I «— 47 is enough to know what the result value was in the 
actual run, but not what the result would have been under all con- 
ditions. The dependence on Ix is lost in T2. If we rerun Ti with a 
different input store Ix = 37, then Ti will correctly return 42 while 
T2 will still return 47. Moreover, the dependences on ly are lost 
in both: changing ly to f invalidates both traces. Instead, the trace 
T3 — cond;(Zy,t, / ^ Ix + IzYi^j^i^ records enough information 
to recompute the result under any (reasonable) change to the input 
store. 

We call traces faithful to e if they record enough information 
to recompute e when the input store changes. We first consider a 
property called partial fidelity. Partial fidelity tells us that the trace 
adaptation semantics is partially correct with respect to the traced 
evaluation semantics. That is, if T was obtained by running e on cri 
and we can successfully adapt T to a new input (72 to obtain 02 and 
T', then we know that (T2 and T' could also have been obtained by 
traced evaluation from 02 "from scratch". 

We first need some lemmas: 

Lemma 2. If [l]T G 6 and a, x€L, e JJ.* a' , L' ,<d then for some 
cr" we have cr, out(r) <^ e[l/x\ ^ cr", T. 

Proof. Induction on the structure of cr, x(zL, e 4* cr', L', Q. 

• The case where = is vacuous since \l]T G S. 

• Suppose the derivation is of the form 

cr, x£L\,e JJ.* cri, L'l, 0i cr, XGL2, e JJ-* cr2, Z/2, ©2 
(j,xeLi U L2,e Jj.* CTi \iia 02, L'l U L2, ©1 © ©2 



Then either \l]T € Oi or [l]T G 02; the cases are symmetric. 
In either case, the induction hypothesis applies and we have 
o, out(T) <^ e[l/x] J]. CTi, T as desired. 

• Suppose the derivation is of the form 

CT,/' ^e[l/x\\].a',T 

CT, x£{l : m}, e 4* o"', {I' : m}, {[l]T : m} 

Then the subderivation o,l' <^ e[l/x] JJ. ct',T is the desired 
conclusion. 

□ 

Lemma 3. If [l]T G O and ^ h r > > r' then we have 
^,1:t h r>out(T) : r'. 

Proof. Straightforward induction similar to Lemma|2] □ 

Lemma 4. If a, T r\ a', T' then out(r) = out(r'). 

Proof. Straightforward induction on derivations. □ 

Theorem 10 (Partial fidelity). Let cti, ctJ , CT2, CT2, T, T' , 0, 0' be 

given. 

1. If ai,l e .J. ctJ , T and CT2 , T r\ 02, T' then CT2 , / e JJ 

2. Ifai,x£Li, e i}.* ctJ , L'l, and 02, x£L2, e, nv* CT2, 1/2, 0' 
then CT2, a;Gl/2, e JJ* CT2, L2, 0' 

Proof. Induction on the structure of the second derivation, with 
inversion on the first derivation. Lemma |2] is needed in part (2) to 
deal with the adaptation case where [l]T £ O holds. 
For part 1, the cases are as follows: 

• If the second derivation is of the form 

CT2, / ^ f rv a2[l := op(t, CT2)], I ^ t 
then the first must be of the form 

ai,l ■<= t i}. ai[l := op(t, ai)],l ^ t 
and so we can immediately conclude 

CT2, / <^ t JJ a2[l ■■= op{t, CT2)], / ^ t 

• If the second derivation is of the form 

^2(1') = {l'lJ'2) 

CT2,Z ^ proji(Z',Zi) r\ CT2[Z := a2{l'i)],l ^ proji(«',Z-) 
then the first derivation is of the form 

CTl(Z') = {h,l2) 

CTi, Z <^ 7rj(Z') -II- CTi[Z := ai(li)],l <— projj(Z', k) 
and so we can immediately conclude 

a2{l') = {l'ul'2) 

CT2,Z ^ 7T,{1') JJ CT2[Z := a2m,l^ Proj,(r,/0 

• If the second derivation is of the form 

CT2, Til r> CT2, r21 CT2, ri2 rv CT2', r22 

CT2 , Til ; ri2 r\ CT2' , T21 ; r22 
then the first derivation must be of the form 

CTi,Z' <^ ei JJ ctJjTh ctJ,Z <^ e2{l' /x] JJ cti,Ti2 
CTi, Z <^ let 2; = ei in 62 JJ ct", Tn ; ri2 
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Then by induction we have (T2, T ei JJ. (J2, T21 and a'2,1 ^ 
e2[l/x] 4J. (72, T22, so can conclude 

0-2, <^ ei JJ. CT2,T2i 0-2, 1 e2[/'/2;] -U- 0-2, T22 

(72, / <^ let 2; = ei in 62 4 (T2 , T21; T22 

If the second derivation is of the form 

02(1) =h a2,Ti r\ a'2,T2 

cr2,cond;(/',6, Ti)e[ r\ a'2,condt(l' ,b,T2)Z 

then the first derivation must be of the form 

ai{l')=b ai,l ^ eb ij- (7'i,Ti 

ai,l ^ \f I' then et else Cf ^ cri, cond; {I' , b, Ti)'e{ 

We proceed by induction, obtaining a2,l <^ et -!]■ 0-2 , 72 and 
concluding 

(J2{l)^h (72, / <^ et J]. (72, r2 

(72, / if /' then fit else ef JJ. (72, cond; &, r2)e[ 
If the second derivation is of the form: 

b^a2{l)=b' (72,/ <^ £(,/ 4 (72,r2 

(72, cond,(/', 6, Ti)Z a'2, cond,(Z', b, T2)l[ 
then again the first derivation must be of the form 
ai{l')=b (71, i efc JJ. (7i,Ti 
(7i, Z <^ if /' then et else ef JJ. cr^, cond;(r, 6, Ti)!:]; 
and we may immediately conclude: 

(^2{l)^b' (72,1 ef i}- a2,T2 
a2,l if I' then et else ef 4 ""21 condi(Z', 6', T2)e[ 
If the second derivation is of the form 

(72, a;G(T2(i')i e, 61 r\* (72,1/2, 62 

(72, / ^ COmp{i', 0l)a;.e ^ CTj := |J (72[-L2]], ' <— COmp(i' , ©2 )a; .e 

then the first derivation must be of the form 

ai,xeai{l'),e JJ.* a[,Li,Oi 

ai,l^ \J{e\x£ l'}il.a[[l ~ \J(^'[Li]], I ^ comp(/', G^^.e 

By induction hypothesis (2), we have that a2,x£a2{l'),e ij.* 
(72 J ^2, 62 holds, so can conclude: 

a2,x£a2{l'),e i).* (72,1/2,02 

a2,l^ [J{e\xe l'}i^a'2[l ~ \J [L2]] , I ^ comp(/', e2)..e 

If the second derivation is of the form 

cr2,x£<72{l'),e, ©1 r^* c^j, 1/2, ©2 

(T2, / SUm(/', ©l)a:.e ^ CTj := X] ""2 [-^2]], ^ SU m (/' , ©2)1 .e 

the reasoning is similar to the previous case. 

For part (2), the proof is by induction on the second derivation: 

If the derivation is of the form: 

(72,a::G0,e,ei rv* (72,0,0 
then we can immediately conclude 

(72,a:G0,e J].* (72,0,0 
If the derivation is of the form: 

(72,2^Gl'21,e, 6l rv* (721,1/21,021 
(72,2;eL22,e,0i rv* (722,^/22 , 022 

(J2,X£L21 U 1/22, e, 01 rv* a21 Wa2 ""22, -^^21 U 1/22 , 021 U 022 



then we proceed by induction, concluding: 

(72, a;eZ/2i, e 4* o'2i, ^21, 02i 
(72, a;Gl/22, e ^l].* (722, ^22, 022 

(72 , 2;eL21 U L22 , e JJ.* (721 Waa t722 , L2I U 1/22 , 021 U 022 

• If the derivation is of the form 

Z^in*(0i) /'fresh (72, ^ e[//x] ^ era, T2 
(72, xe{l : m}, e, 0i rv* (72, {I' : m}, {[Z]T2 : m} 
then we can immediately conclude: 

a2,l' -^e[l/x]ii.a'2,T2 /'fresh 
(72,a::e{/ : m},e J|* (72,{/' : m},{[;]r2 : m} 

• If the derivation is of the form: 

Hrig0l (72,Tl rV(7^,T2 

a2,x€{l : m}, e, 0i rv* a'2, {out(T2) : m}, {[/]r2 : m} 

then observe that out(ri) — out(T2) by Lemma|4] Moreover, 
by Lemma[2l we have ai,out(Ti) <^ e[l/x] i}. a",Ti, so by 
induction we have (72, out(ri) <^ e[//x] JJ. (72, 72, and we can 
conclude 

(72,out(ri) ^ e[z/a] j;(7^,r2 

a2,xG{l : m},e J|* (72,{out(T2) : m},{[l]T2 : m} 

a 

However, partial fidelity is rather weak since there is no guaran- 
tee that T can be adapted to a given (72. To formalize and prove to- 
tal fidelity, we need to be careful about what changed inputs (72 we 
consider. Obviously, (72 must be type-compatible with T in some 
sense; for instance we cannot expect a trace such I ^ li + I2 to 
adapt to an input in which /i — t. Thus, we need to set up a type 
system for stores and traces and prove type-soundness for traced 
evaluation and adaptation. 

More subtly, if we have a trace I ^ t that writes to / and we try 
to evaluate it on a different store that already defines I, perhaps at a 
different type, then the adaptation step may succeed, but the result 
store may be ill-formed, leading to problems later on. In general, 
we need to restrict attention to altered stores (72 that preserve the 
types of labels read by T and avoid labels written by T. 

We say that a matches ^ avoiding S (written a <: \E' # S) if 
(7 : *' for some *' 3 * with (iom(*')nS' = 0. That is, (7 satisfies 
the type information in 9, and may have other labels, but the other 
labels cannot overlap with S. Moreover, when L is a collection 
of labels {h : mi, . . . ,ln : ?n„}, we sometimes write L:t as an 
abbreviation for ii : r, ...,/„: r; thus, a <: 'J, L:t # 5* stands 
for (7 <: *,Zi:r, . . .,1„:t # S. 

We also need to be careful to avoid making the type system 
too specific about the labels used internally by T, because these 
may change when T is adapted. We therefore introduce a typing 
judgment for traces h T l> ? : r, meaning "In a store matching 
type 4', trace T produces an output I of type r." Trace typing does 
not expose the types of labels created by T for internal use in the 
rules for let and comprehension. The rules are shown in Figure [25] 
along with the auxiliary judgment '5 h r > > r', meaning "In a 
store matching the labeled traces operate on inputs of type r 
and produce outputs of type r'". 

We now show that for well-formed expressions and input 
stores, traced evaluation can construct well-formed output stores 
and traces avoiding any finite set of labels. Here, we need label- 
avoidance constraints to avoid label conflicts between ai and (72 
in the J^*-rule for 0i © 02. We also need these constraints later 
in proving Theorem (TS) Next we show traced evaluation is sound, 
that is, produces well-formed traces and states. 



14 



2008/12/2 



hterm t : T l'(r) = Tj X T2 

^ \- I tt>l : T ^ \- I ^ pro]^{l' > I : Ti 
*I-Ti>;':r' -^^V-.t' \-T2>1:t 
'J h Ti ; T2 Z : T 
*(Z') = bool *|-rt>;:r 1'l-et;T 1'hef;T 

*l-condi{'',b, T)1{>Z:t 
*(;') = {t'} * h t' {t} , x:t' I- e : {t} 
* h i <— comp(;', Q)x.e > / ; {r} 
*(/') = {r'} *I-T'>e>int ^'.itr' h e : int 
* h Z «— sum(Z', e)a;.e > / : int 
*,/:t h r>Z' : t' 
*I-t>0C>t' * l-rl>{[«]T : m}>T' 

*|-T>6i>t' >I'hT>62>T' 

* h T 1> 01 e 02 > r' 

Figure 25. Trace well-formedness 



Theorem 11 (Traceability). Let S be a finite set of labels, and 
^,e,T,l,a be arbitrary. 

1. If^ h e : r and a < : ^ # 5 U {/} then there exists a' ,T such 
that (7,1 ^ e a' ,T and a' <: *, I.t # S. 

2. If 'if, x:t h e : r' and a <: L:t ^ S U L' then there exists 
a' , © such that a, x£L, e Jj.* a' , L' , Q and a' <: ^, L':t' # S 

Proof. For part (1), proof is by induction on the structure of deriva- 
tions of *!/ h e : r. 

• If the expression is a term t then we have 

^ I- term t : T 

* h i : r 
Hence, * hcon op(cr, : r so 

a,l t ij. a[l := op{a, t)],l 

where a[l := op{a.t)] <: ^P, 1:t # S. 

• If the derivation is of the form 

*l-7ri(0 -.Ti 

then we know ^ hcon o-{l') : ti x T2 so we must have 
a{l') = {h, I2). Hence, we can derive 

ajl') = jluh) 

o-,/ <S= 7ri(/') ij. ct[1 := a(li)],l^ proiS'Ji) 

where a[l := a(k)] <: 'i! # S. 

• If the derivation is of the form 

* h ei : r' *,a;:r' h 62 : r 
h let a; = ei in e2 : T 

then choose a fresh I' dom(o-) U S U {I}. By induction we 
have a, I' ^ ei J| a' ,Ti where cr' <: 1',r:r' # 5 U {/}. 
Substituting /' for x, we have '^,1':t' h e2[^/x] : r so by 
induction we also have a' ,1 e2\l' /x\ ^ a" ,T2 where 
<t" <: 1':t', 1:t # S. Finally we can derive 

r fresh (7,r ^ ei j;cr',Ti a' ,1 ^ e2[l' /x] \^ a" ,T2 

cr, Z ^ let X = ei in 62 JJ- cr", Ti; 32 

and cr <: /:r # S. 



• If the derivation is of the form 

= bool * h et : T 5' h ef : t 

^ I- if /' then et else Cf : r 

then we must have oil') = 6 G B. By induction, we obtain 
a,l ^ Cb ^ cr',T where cr' <: *,Z:r # S. Thus, we can 
conclude 

= h cr, I e;, JJ. O"', T 
(T, Z <= if V then et else ef JJ. cr', cond;(r, 6, T)e{ 

• If the derivation is of the form 

*(/) = {t'} *,x:r' h e : {r} 

* h U{e I a; € Z} : {t} 

then we must have a{l) = L where >!' hcon L' : {r'}. Then 
there exist cr', L', 6 such that cr, xGcr(/), e 4* cr', L', 9 and 
<T <: *, L':{t'} # {/'} U S. Hence we can conclude 

cr,a;ea(0,e a',i',e 

o-,r <^ U{e I a; e l}!^<j'[l' := \_\<^'[L']],l' ^ comp(Z, e),,.e 

andcT <: *,Z':{r'} #5. 

• The case for ^^{e | a; € Z} is similar. 

For part (2), the proof is by induction on L: 

• If 1/ = then we can immediately conclude 

<T, a;60, e JJ.* a, 0,0 

where a <:<S> il^ S. 

• If L = I/i © L2 then by induction we have a,xGLi,e JJ.* 
CTi, L'l, 61 where ai <: -Li:r' # 5. Moreover, we also have 
cr, x€L2,e il-* cr2,L'2,&2 where cr2 <: ^, L2:t' # (dom(cri) — 
dom(cr)) U S. Thus, ai Wo- cr2 exists and avoids S; hence, 

cr, x£Li,e 4* cri, L'l, 9i a, xeL2, e JJ.* 0-2,^2, ©2 
cr, a;GLi © L2, e JJ.* ai tiJa 0-2, i'l © L2, ©i ® ©2 

and fTi tbicr 0-2 <: Li U L2:t' # S'. 

• If L = : m} then we can substitute to obtain /:r h 
e[l/x] : t'. Choose I' fresh for dom(cr) U 5 so that we have 

cr <: Z:r # 5 U {I'}. Then by induction we have a, I' <= 
e[l/x] JJ. a',T where a' <: -^i ,1:t,1':t' # S". Then we can 
conclude 

Z' fresh a, Z' e[Z/a;] JJ a', T 
cr,a;e{Z : m},e JJ* a', {I' : m},{[l]T : m} 
since <t' <: *,Z':r' # 5. 

□ 

Theorem 12 (Soundness of traced evaluation). Let e, r, Z, cr Z)e 

arbitrary. 

L If \- e : T and cr, Z <^ e JJ a', T and cr <: * f/ien 

h T > Z : r anrf cr' <: I.t. 
2. If^.,x:T heir' awii a <: ^.,L : t and a,x€L,e JJ* 
o-', L', © Jten * h T > © r' awrf a' <: L' : t'. 

Pwo/ For part (1), proof is by induction on the second derivation. 

• If the derivation is of the form 

a,l t i). all := op(t, cr)], Z <— t 

then by inversion we have that * hterm t : r and so we can 
derive 

I' I- term t : T 

* h Z ^ toZ : T 
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• If the derivation is of the form 

= {hM) 

a, I ml' -H a[l := <— proj^{l' , U) 

then by inversion we have that *(Z') = ti x T2, so we may 
conclude: 

= Tl X T2 

* h Z ^ proji(/',Zi) >/ : Ti 

• If the derivation is of the form 

a,l' ^ ei \\. (Ji,Ti cr,Z <^ 62 [/'/x] JJ. (72,22 

— — /' frpsh 

0-, / <^ let a; = ei in 62 J| ct2, Ti; T2 ' ^^^'^ 

then we must also have 

* I- ei : r' *,a:::r' h 62 : r 
^' h let a; = ei in 62 : r 

and by induction and substituting I' for x we have 'J' h Ti > /' : 
r' and Z':t' h T2 > ? : r. So we may conclude 

*|-Tit>/':r' *,r:r' h T2>/ : T 
* h Ti;r2 : T 

• If the derivation is of the form: 

ct(Z')=6 a,Z<^e6 JJ.ct',T 
(T,Z <^ if Z' then et else ef ||. <t', condi(Z', 6, T)e; 
then by inversion we must have 

*(Z') = bool * h 6t : r ^ \- : t 

* I- if Z' then et else 6f : t 

Hence whatever the value of b, by induction we can obtain 
^ h T > Z : T. To conclude, we derive: 

^'(Z') = boo! 4'l-r>Z:T *l-6t:T *l-6f:T 

* hcondi(Z',&,T)^;t>Z : r 

• If the derivation is of the form 

cr,a;€(7(Z'),e JJ.* a',L',Q 
C7, Z ^ U{e I a; € Z'} J| o-'[Z := U a'[L']], I ^ comp(Z', e)^.e 
then by inversion we have 

* h U{e I a; € Z'} : {t} 

Then by induction hypothesis (2) we have that * h t'i>0>{t}, 
so we may conclude: 

*(/') = {-'} vl; I r' [> (-) t> {-} *, ,r:r' : <, : {r} 

* h Z <— COmp(Z', 6)a:.e > ' : {t} 

• For the ^ case, 

(T, x-Gcr(Z'), 6 4* i^', -f'', 9 
cr, Z E{e I G (^'[Z := E o-'M], « ^ sum(Z', e)..e 
the reasoning is similar to the previous case. 

For part (2), proof is by induction on the structure of the third 

derivation. 

• If the derivation is of the form: 

a,x£^,e\y (7,0,0 
then we can immediately derive 

* h T > > r' 



• If the derivation is of the form: 

(7,1' ^e[l/x]\^o',T 

a, x£{l : m}, e 4* cr', {I' : m}, {[l]T : m} 

then we may substitute I for x to obtain >3',Z:r h e[l/x\ : t' 
and so by induction hypothesis (1) we have Z:t h r>Z' : r'. 
We may conclude by deriving: 

^,l:ThT>l' -.t' 

*l-T[>{[Z]r:m}>T' 

• If the derivation is of the form: 

(7, a;eLi,e jl-* (7i,Li,6i a,xeL2,e i}-* a2, 2/2,02 

(7, XeLi ® L2, 6 J|.* (71 tbic, (72, L'l ® L2, ©1 ® ©2 

then by induction we obtain ^ h r t>0i >r' and h r [>©2 i>t' 
so conclude 

^ I- r > 01 > r' ^ I- r > 02 t> r' 
* I- T t> 01 ® 02 > t' 

□ 

We define the set of labels written by T, or Wr(r), as follows: 
Wr(Z ^t) = {1} 
Wr(Z^proj,(Z',Z,)) = {Z} 
Wr(condi(Z',6,T):?) = {Z} U Wr(r) 

Wr(Ti;r2) = Wr(Ti) U Wr(T2) 
Wr(Z^comp(Z',0),.e) = {Z}uWr(0) 
Wr(Z^sum(Z',0)x.e) = {Z} U Wr(0) 

Wr(0) = U{Wr(r) I [l]T : m £ 0} 

Finally, we show that the adaptive semantics always succeeds 
for well-formed traces T and well-formed stores that avoid the 
labels written by T. 

Theorem 13 (AdaptabiUty). Let S be a finite set of labels, and 
^,T,T,l,a be arbitrary. 

1. If^ \- T t>l : T and (7 <: * # S U Wr(T) then there exists 
a', T' such that a,T r\ a' ,T' and (7' <: *, Z:t # S. 

l.If'^ h t>Q>t' and *, x:r h e : r' and a <: : 
T # Wt{Q)US then there exist a', L' , 0' such that a, xSL, e, 
a', L', 0' and (7' <: *, L':t' # 5. 

Proof. For the first part, proof is by induction on the structure of 
the first derivation. 

• If the derivation is of the form 

^ I- term t : T 

* h Z ^ t>Z : r 

then we can conclude 

a, I ^ t r\ a[l ~ op{t, a)],l <~ t 

since a avoids Wr(Z <— t) = {I}. Moreover, a <: Z:t # S. 

• If the derivation is of the form 

*(Z') = n X r2 

* I- Z ^ projS',h)t>l ■■ Ti 
then a{l') must be a pair (Z'l , Z2), and we can conclude 

erg') = (Z'l, z^) 

(7, Z ^ proji(Z', Zi) rv (7[Z := (7(Z-)], Z <- proji(Z', Z-) 

since a avoids Wr(Z proj;(Z', Zi)) — {I}. Note that we do 
not re-use k so the typing judgment does not need to check that 



16 



2008/12/2 



it is of the right type. In fact, li need not be in at all. Finally, 

a' <: <^,l:n#S. 

If the derivation is of the form 

*|-ril>Z':r' ■^,1':t ^T2>1:t 

* h ri;r2>; : r 

then since I' G Wr(Ti) and cr <: * # Wr(Ti) U (Wr(T2) U 
S), by induction we have that a, Ti rv o' ,T[ and a' <: 
1',r:r'#Wr(r2)US. Moreover, since cr' <: *, T :t' # Wr(r2)U 
S by induction we have o-', r2 rv cr",r2 and ct" <: 'i' ,1' -.t' ,1:t # S. 
Hence we may derive 

, i; I2 r\ a , li, I2 

and also we have a" <: vp, 1:t ^ S as desired. 
If the derivation is of the form 

= bool *|-T>/:r 'I'l-etiT *|-ef:r 
* h condi{l',b,T)Z>l : r 

then we must have (j{l') £ B. There are two cases. Suppose 
a{l) = h. Then by induction we have that a,T r\ a' , T' and 
ct' <: t.T # S. We can conclude 

cj(l')=h a,Trxa',T' 

a,condi{l',b,T)ll rv a' ,condi{l' ,b,T')l{ 

Otherwise, <j{l') — b' b. So using Theorem 1111 we have 
a',T' such that a, I ^ ey ii- a' ,T' and cr' <: 1:t # S, so 
we may conclude 

a{l') = b'j^b a,l ^ et' li.a',T' 

a,condi(l',b,T)Z rv a', cond; (/', 6, r')e' 

If the derivation is of the form 

= *l-r'l>e>{r} *,s:r' h e : {r} 

* h / ^ comp(r, 6)2:. e > ? : {r} 

then for L — u{l'), since 'i/ hcon : {i"'} we have 

a <: 'if,L : r' Wr(e) U S. Hence by induction we 
have a',L',0' such that ct, a;£cr(/'), e, rv* a',L',Q' and 
a' <: *, L' : {r} # S. Therefore, |J a'[L'] is well-defined so 
we can conclude 

a,xea{l'),e,e n.* a',L',Q' 
a, I <— comp(/', Q)x.e r\ a'[l := |J cr'[L']], / ^ comp(/', Q')x.e 
If the derivation is of the form 

*(/') = {r'} *hr'l>ei>int *, iir' h e : int 
* h / ^ sum(/', 0):c.e > / : int 
then the reasoning is similar to the previous case. 
For part (2), the proof is by induction on the structure of L. 
If 1/ = 0, then then we can simply conclude 

cr,a;G0,e, e r\* 0,0, 

If L = {Z : m} then there are two cases. If [l]T G O 
for some T, then we proceed as follows. Let /' = out(r). 
By Lemma [3l we have that *,/:r h e[l/x] l> /' : r'. So, 
by induction hypothesis (1), we have a,T rv a' ,T' where 
cr' <: vf, 1':t' # S. To conclude, we derive: 

[l]T€e a,Tn.a',T' 

a, xe{l : m}, e, 6 rx* a', {1' : m}, {[l]T' : m} 

Otherwise, I ^ in*(0), so we fall back on traced evaluation. 
Choose I' fresh for I, a and S. Since a <: ^,1:t ij^ S, 



by Theorem II II we can obtain a, I' <^ e JJ. g' ,T' where 
a <: 9 ,1':t' ^ S.To conclude we derive 

/^in*(e) /'fresh a,l' ^ e[l/x] i]. a' ,T' 

a, xe{l : m}, e, & rv* a', {I' : m}, {[l]T' : m} 

• If L = Li ® L2, then clearly, a <: Li:t # Wr(rs) U S 
so by induction we have a, x£Li, e,Q r>* cri, L'l, Oi where 
CTi <: *,L'i:r' # S. Similarly, we have a,x£L2,e,Q r\* 
CT2, 1/2, 02 where (72 <: I/a:^' # (dom(o-i)-dom(o-))U5'. 
Hence, ai and (72 are orthogonal extensions of a, so ai Wo- (72 
exists and (7i Uo- (72 <: 'I', i'l U -L2:''"' # •S'- We conclude by 
deriving: 

(7, xGLi, e, r>* ai, L'l, Gi a, xGL2,e, Q n^* 172, L21 ©2 
(7,3;GLi L2,e, & r>* cri (72,ii 1-2,91 © ©2 

□ 

By combining the above partial fidelity and soundness theo- 
rems, we can finally obtain our main result: 

Corollary 1 (Total Fidelity). Suppose ai,l ^ e ii- a'i,Ti where 
(7i : ^ and ^ h e : t and suppose (72 <: * # Wr(r). Then there 
exists (72, T2 such that (72, Ti r\ 02, T2 and (72, / e i}- a'2,T2. 

Proof. By Theorem 1121 we have that h Ti > I : t. Thus, by 
Theorem 1 1 3 1 there must exist T2 , (J2 such that (J2,T\ rv (72 , T2 . By 
Theorem[Tol it follows that CT2 , / e J| (72 , 72 . □ 

6. Trace slicing 

As noted above, traces are often large. Traces are also difficult to 
interpret because they reduce computations to very basic steps, like 
machine code. In this section, we consider slicing and other sim- 
plifications for making trace information more useful and readable. 
However, formalizing these techniques appears nontrivial, and is 
beyond the scope of this paper. Here we only consider examples of 
trace slicing and simplification techniques that discard some of the 
details of the trace information to make it more readable. 

Example 8 Recall query Qi. If we are only interested in how row 
l\ in the output was computed, then the following backwards trace 
slice answers this question. 

1 <- comp(r,{ 

[rl] xll <- proj_C(rl,rl3) ; xl <- comp(s,{ 

[s3] xl31 <- proj_C(s3,s31) ; xl32 <- xll = xl31; 
cond(xl32,t,lll <- proj_A(rl,rll) ; 

112 <- proj_B(rl,rl2) ; 

113 <- proj_D(s3,s32) ; 

11 <- (A:111,B:112,D:113) ; 
xl36 <- {11»})» 

Note that the slice refers only to the rows r\ and S3 that contribute 
to the semiring-provenance of Zi. Moreover, the where-provenance 
and dependency-provenance of /i, in, I12, and /13 can be extracted 
from this slice. 

To make the slice more readable, we can discard information 
about projection and assignment steps and substitute expressions 
for labels: 

1 <- comp(r,{ 

[rl] xl <- comp(s,{ 

[s3] cond(rl3 = s31,t,ll <- (A :rll ,B :rl2,D : s32) ; 

xl36 <- {11})})}) 

We can further simplify this to an expression {[A : r\\,B : 
r\2,D ■ S32)} that shows how to calculate l\ from the original 
input, but this is not guaranteed to be valid if the input is changed. 
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Example 9 In query Q2, if we are only interested in the value 7 
labeled by l'^2^ its (simplified) backwards trace slice is: 

112' <- sum(s,{[sl] condCsll = 2, t, xl3 <- sl2) , 
[s2] cond(sl2 = 2, t, x23 <- s22) , 
[s3] cond(sl3 = 2, f, x33 <- 0)»; 

and from this we can extract an expression such as S12 + S22 that 
describes how the result was computed. 

7. Related and future work 

Provenance has been studied for database queries under various 
names, including "source tagging" and "lineage". We have al- 
ready discussed where-provenance, dependency provenance and 
the semiring model. Wang and Madnick ( 1990) described an early 
provenance semantics meant to capture the original and interme- 
diate sources of data in the result of a query. Cui, Widom and 
Wiener defined lineage, wh ich aims to ident ify source data relevant 
to part of the output. Bun eman et alJ dlOOlh also introduced why- 
provenance, which attempts to highlight parts of the input that ex- 
plain why a part of the output is the way it is. As discussed earlier, 
lineage an d why-provengm c e are instances of the semiring model. 
Recently, Beniel loun et al.l |2006) have studied a new form of lin- 
eage in the Trio system. According to Green (personal communi- 
cation), Trio's lineage model is also an instance of the semiring 
m odel, so can also be extr acte d from traces. 

iBuneman et al.l i2006h and iBuneman et alJ (l2007h investigated 
provenance for database updates, an important scenario because 
many scientific databases are curated, or maintained via frequent 
manual updates. Provenance is essential for evaluating the scientific 
value of curated databases (Buneman et al. 2008). We have not 
considered traces for update languages in this paper. This is an 
important direction for future work. 

Provenance has also been studied in the context of (scientific) 
workflows, that is, high-level visual programming languages and 
systems developed recently as interfaces to complex distributed 
Grid co mputation. Techniques for workflow proven ance are sur- 
veyed bv lBose andPrewl ( l2005h and lSimmhan et all (§005). Most 
such systems essentially record call graphs including the names 
and parameters of macroscopic computation steps, input and out- 
put filenames, and other system metadata such as architecture, op- 
erating system and li brary versions. Similarly, provenance-aware 
storage systems ( Mun iswamv-Reddv e t al. 2006) record high-level 
trace information about files and processes, such as the files read 
and written by a process. 

To our knowledge formal semantics have not been developed for 
most workflow systems that provide provenance tracking. Many of 
them involve concurrency so defining their semantics may be non- 
trivial. One well-specified approach is the NRC-based "dataflow" 
model of ( Hidde rs et all200 7). who define an instrumented seman- 
tics that records "runs" and consider extracting provenance from 
runs. However, their formalization is incomplete and does not ex- 
amine semantic correctness properties comparable to consistency 
and fidelity; moreover, they have not established the exact relation- 
ship between their runs and existing forms of provenance. 

As discussed in the introduction, provenance traces are related 
to the traces used in the adaptive functional programming language 
AFL ( Acar et al. 2006). The main difference is that AFL traces are 
meant to model efficient self-adjusting computation implementa- 
tions, whereas provenance traces are intended as a model of execu- 
tion history that can be used to answer high-level queries compa- 
rable to other provenance models. Nevertheless, efficiency is obvi- 
ously an important issue for provenance-tracking techniques. The 
problem of efficiently recomputing query results after the input 
changes, also called view maintenance, has been studied exten- 
sively for materialized views (cached query results) in relational 



databases l lGupta a nd Mumick '1995). View maintenance does not 
appear to have been studied in general for NRC, but provenance 
traces may provide a starting point for doing so. View maintenance 
in the presence of provenance seems to be an open problem. 

Provenance traces may also be useful in studying the view up- 
date problem for NRC queries, that is, the problem of updating 
the input of a query to accommodate a desired change to the out- 
put. This is closely related to bidirectional computation techniques 
that have been developed for XML trees ( Foster et al. 2007), flat 
relational queries (Bohannon et al. 2006), simple functional pro- 
gram s (Matsuda et al. 2007), and text processing ( Bohannon et alj 
I2OO8I) . Provenance-like metadata has already been found useful in 
some of this work. Thus, we believe that it will be worthwhile to 
further study the relationship between provenance traces and bidi- 
rectional computation. 

There is a large body of related work on dynamic analysis 
techniques, including slicing, debugging, justification, informa- 
tion flow, dependence tracking, and profiling techniques, in which 
execution traces play an essential role. We cannot give a com- 
prehensive overview of this work he re, bu t refer to ( VenkatesH 
19911: lA roraetal. 1993; Abadi e t^ [l996l : iField and Tip 19981; 
Abadi et al, 1999. : ,Ochoa et aL.200 4) as sources we found useful 
for inspiration. However, to our knowledge, none of these tech- 
niques have been studied in the context of database query lan - 
guages, and our work reported previously in l lChenev et al.ll2007l) 
and in this paper is the first to connect any of these topics to prove- 
nance. 

Trace semantics is also employed in static analysis; i n particu- 
lar, see (Rival and Mauborgne 2007). Chene v'et alj | |200 7) defined 
a type-and-effect-style static analysis for dependency provenance; 
to our knowledge, there is no other prior work on using static anal- 
ysis to approximate provenance or optimize dynamic provenance 
tracking. 

8. Conclusions 

Provenance is an important topic in a variety of settings, partic- 
ularly where computer systems such as databases are being used 
in new ways for scientific research. The semantic foundations of 
provenance, however, are not well understood. This makes it diffi- 
cult to judge the correctness and effectiveness of existing proposals 
and to study their strengths and weaknesses. 

This paper develops a foundational approach based on prove- 
nance traces, which can be viewed as explanations of the opera- 
tional behavior of a query not on just the current input but also on 
other possible (well-defined) inputs. We define and give traced op- 
erational semantics and adaptation semantics for traces and prove 
consistency and fidelity properties that characterize precisely how 
traces produced by our approach record the run-time behavior of 
queries. The proof of fidelity, in particular, involves subtleties not 
evident in other trace semantics systems such as AFL ( Acar et al] 
2006) due to the presence of collection types and comprehensions, 
which are characteristic of database query languages. 

Provenance traces are very general, as illustrated by the fact 
that other forms of provenance information may be extracted from 
them. For instance, we show how to extract where-provenance, de- 
pendency provenance, and semiring provenance from traces. De- 
pending on the needs of the application, these specialized forms 
of provenance may be preferable to provenance traces due to ef- 
ficiency concerns. As a further application, we informally discuss 
how we may slice or simplify traces to extract smaller traces that 
are more relevant to part of the input or output. 

To our knowledge, our work is the first to formally investi- 
gate trace semantics for collection types or database query lan- 
guages and the first to relate traces to other models of provenance 
in databases. There are a number of compelling directions for fu- 
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ture work, including formalizing interesting definitions of trace 
slices, developing efficient techniques for generating and query- 
ing provenance traces, and relating provenance traces to the view- 
maintenance and view-update problems. 
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